Skip to content

azurerm_firewall_policy_rule_collection_group - fix timeout expiring while waiting for policy lock#32081

Closed
sanderaernouts wants to merge 1 commit intohashicorp:mainfrom
sanderaernouts:fix/firewall-policy-rcg-timeout-before-lock
Closed

azurerm_firewall_policy_rule_collection_group - fix timeout expiring while waiting for policy lock#32081
sanderaernouts wants to merge 1 commit intohashicorp:mainfrom
sanderaernouts:fix/firewall-policy-rcg-timeout-before-lock

Conversation

@sanderaernouts
Copy link
Copy Markdown

@sanderaernouts sanderaernouts commented Mar 31, 2026

Summary

  • Fix timeout-before-lock ordering in azurerm_firewall_policy_rule_collection_group CreateUpdate and Delete functions
  • Prevents operations from timing out when multiple rule collection groups target the same firewall policy

Problem

We recently started seeing timeouts when deploying multiple azurerm_firewall_policy_rule_collection_group resources that target the same firewall policy, Terraform processes them in parallel. Azure enforces serial processing on the same firewall policy, responding with 409 AnotherOperationInProgress for concurrent requests.

The provider already serializes these operations using locks.ByName(), which prevents the 409 errors. However, the timeout context (context.WithTimeout with a 30-minute deadline) was created before the lock was acquired:

ctx, cancel := timeouts.ForCreateUpdate(...)  // 30-minute timer STARTS HERE
// ...
locks.ByName(...)                              // BLOCKS here, timer keeps ticking
// ...
client.CreateOrUpdateThenPoll(ctx, id, param)  // ctx may have insufficient time remaining

When N goroutines compete for the same lock, all N timers start simultaneously. Goroutines that wait for the lock have their timeout budget consumed by wait time, eventually causing context.DeadlineExceeded errors that manifest as operation timeouts.

Fix

Move lock acquisition before the timeout context creation, so each operation gets its full timeout regardless of how long it waited for the lock:

locks.ByName(...)                              // BLOCKS here, no timer yet
// ...
ctx, cancel := timeouts.ForCreateUpdate(...)  // 30-minute timer starts AFTER lock acquired
// ...
client.CreateOrUpdateThenPoll(ctx, id, param)  // full 30 minutes available

Applied to both resourceFirewallPolicyRuleCollectionGroupCreateUpdate and resourceFirewallPolicyRuleCollectionGroupDelete.

Side effects

  • The existence check (client.Get) in CreateUpdate is now inside the lock scope

Note

All 6 firewall resource files have this same timeout-before-lock pattern. This PR only fixes firewall_policy_rule_collection_group_resource.go. The same fix can be applied to other resources separately if needed.

…while waiting for policy lock

When multiple firewall policy rule collection groups target the same
firewall policy, Terraform processes them in parallel but Azure enforces
serial processing (responding with 409 AnotherOperationInProgress).

The provider already serializes these operations using locks.ByName(),
but the timeout context was created before the lock was acquired. This
meant time spent waiting for the lock consumed the 30-minute timeout
budget, causing later operations to fail with context.DeadlineExceeded.

Move lock acquisition before the timeout context creation in both
CreateUpdate and Delete so each operation gets its full timeout budget
regardless of how long it waited for the lock.
@sanderaernouts sanderaernouts marked this pull request as ready for review March 31, 2026 14:54
@sanderaernouts sanderaernouts requested review from a team, WodansSon and magodo as code owners March 31, 2026 14:54
@mstroob
Copy link
Copy Markdown
Contributor

mstroob commented Apr 1, 2026

Thanks for the PR, however, the timeouts should apply to the create/update operation, not to a specifc api call, if you need to increase the timeout for your resource specifically see https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/firewall_policy_rule_collection_group#timeouts

@mstroob mstroob closed this Apr 1, 2026
@sanderaernouts
Copy link
Copy Markdown
Author

sanderaernouts commented Apr 2, 2026

@mstroob I see your point. However, Azure does not allow concurrent updates to firewall resources but the azurerm provider tries to concurrently create or update rule collection groups on the same firewall policy for example. Lately the updated for rule collection groups have been taking longer and longer causing us to hit the timeout.

The current locking mechanism seems like a workaround for this. Because the timeout context is created before the lock is acquired the timeout actually applies all preceding rule collection group updates as well.

Increasing the timeout is not a real fix for the problem. We see updates now often taking 10–15 minutes so if you have a policy with say 10 rules groups that require an update that would require me to set a 150-minute timeout to workaround the fact that the provider attempts a concurrent operation on a resource that does not allow that.

My PR was an attempt to "fix" this, but it would indeed still cause create/update operations to "hang" and now without a timeout. Do you see another way to fix this? The real fix in my opinion would be that the provider does not attempt parallel create/update/delete operations on the same firewall policy ID to begin with. In that case the lock can be removed here as well because the locking is done on a higher level.

@sanderaernouts
Copy link
Copy Markdown
Author

sanderaernouts commented Apr 2, 2026

@mstroob here is an alternate implementation, in a draft PR #32094. Though technically very similar it leverages ...WithoutTimeOut which seems the pattern the SDK recommends for cross-resource synchronization:

"there are cases where operation synchronization across concurrent resources is necessary in the resource logic, such as a mutex, to prevent remote system errors. Since these operations would have an indeterminate timeout that scales with the number of resources, this allows resources to control timeout behavior."

-- CreateWithoutTimeout documentation (source)

The timeout still covers the full API operation (create/update/delete + polling). It just starts after the mutex is acquired, so lock contention across concurrent rule collection groups on the same policy doesn't eat into the timeout budget.

@mstroob
Copy link
Copy Markdown
Contributor

mstroob commented Apr 8, 2026

Hi @sanderaernouts I would suggest checking the issues to see if there is a similar one, or opening a new one and including the required details, especially a way to reproduce the issue, as there may be another underlying problem that causes the very slow updates that could be fixed, rather than just removing the timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants